{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to TensorFlow Data Validation\n",
    "\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "1. Review TFDV methods.\n",
    "2. Generate statistics.\n",
    "3. Visualize statistics.\n",
    "4. Infer a schema.\n",
    "5. Update a schema.\n",
    "\n",
    "\n",
    "\n",
    "## Introduction \n",
    "This lab is an introduction to TensorFlow Data Validation (TFDV), a key component of TensorFlow Extended.  This lab serves as a foundation for understanding the features of TFDV and how it can help you understand, validate, and monitor your data. \n",
    "\n",
    "TFDV can be used for generating schemas and statistics about the distribution of every feature in the dataset. Such information is useful for comparing multiple datasets (e.g. training vs inference datasets) and reporting:\n",
    "\n",
    "Statistical differences in the features distribution\n",
    "TFDV also offers visualization capabilities for comparing datasets based on the Google PAIR Facets project.  \n",
    "\n",
   "Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/tfdv_basic_spending.ipynb).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AsHg6SD2nO1v"
   },
   "source": [
    "### Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pyarrow in /opt/conda/lib/python3.7/site-packages (6.0.1)\n",
      "Requirement already satisfied: numpy>=1.16.6 in /opt/conda/lib/python3.7/site-packages (from pyarrow) (1.21.6)\n",
      "Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (1.21.6)\n",
      "Collecting tensorflow-data-validation\n",
      "  Downloading tensorflow_data_validation-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (18.2 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.2/18.2 MB\u001b[0m \u001b[31m56.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hRequirement already satisfied: absl-py<2.0.0,>=0.9 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.4.0)\n",
      "Requirement already satisfied: apache-beam[gcp]<3,>=2.40 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (2.46.0)\n",
      "Requirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.2.0)\n",
      "Requirement already satisfied: numpy<2,>=1.16 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.21.6)\n",
      "Requirement already satisfied: pandas<2,>=1.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.3.5)\n",
      "Requirement already satisfied: protobuf<4,>=3.13 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (3.19.6)\n",
      "Requirement already satisfied: pyarrow<7,>=6 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (6.0.1)\n",
      "Collecting pyfarmhash<0.4,>=0.2 (from tensorflow-data-validation)\n",
      "  Downloading pyfarmhash-0.3.2.tar.gz (99 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.9/99.9 kB\u001b[0m \u001b[31m8.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hRequirement already satisfied: six<2,>=1.12 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.16.0)\n",
      "Requirement already satisfied: tensorflow<3,>=2.11 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (2.11.0)\n",
      "Requirement already satisfied: tensorflow-metadata<1.13,>=1.12.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.12.0)\n",
      "Requirement already satisfied: tfx-bsl<1.13,>=1.12.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-data-validation) (1.12.0)\n",
      "Requirement already satisfied: crcmod<2.0,>=1.7 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.7)\n",
      "Requirement already satisfied: orjson<4.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.8.11)\n",
      "Requirement already satisfied: dill<0.3.2,>=0.3.1.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.3.1.1)\n",
      "Requirement already satisfied: cloudpickle~=2.2.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.2.1)\n",
      "Requirement already satisfied: fastavro<2,>=0.23.6 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.7.3)\n",
      "Requirement already satisfied: fasteners<1.0,>=0.3 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.18)\n",
      "Requirement already satisfied: grpcio!=1.48.0,<2,>=1.33.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.48.1)\n",
      "Requirement already satisfied: hdfs<3.0.0,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.7.0)\n",
      "Requirement already satisfied: httplib2<0.22.0,>=0.8 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.21.0)\n",
      "Requirement already satisfied: objsize<0.7.0,>=0.6.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.6.1)\n",
      "Requirement already satisfied: pymongo<4.0.0,>=3.8.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.13.0)\n",
      "Requirement already satisfied: proto-plus<2,>=1.7.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.22.2)\n",
      "Requirement already satisfied: pydot<2,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.4.2)\n",
      "Requirement already satisfied: python-dateutil<3,>=2.8.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2018.3 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2023.3)\n",
      "Requirement already satisfied: regex>=2020.6.8 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2022.10.31)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.24.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.28.2)\n",
      "Requirement already satisfied: typing-extensions>=3.7.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (4.5.0)\n",
      "Requirement already satisfied: zstandard<1,>=0.18.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.18.0)\n",
      "Requirement already satisfied: cachetools<5,>=3.1.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (4.2.4)\n",
      "Requirement already satisfied: google-apitools<0.5.32,>=0.5.31 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.5.31)\n",
      "Requirement already satisfied: google-auth<3,>=1.18.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.17.3)\n",
      "Requirement already satisfied: google-auth-httplib2<0.2.0,>=0.1.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.1.0)\n",
      "Requirement already satisfied: google-cloud-datastore<2,>=1.8.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.15.5)\n",
      "Requirement already satisfied: google-cloud-pubsub<3,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.16.0)\n",
      "Requirement already satisfied: google-cloud-pubsublite<2,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.7.0)\n",
      "Requirement already satisfied: google-cloud-bigquery<4,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.10.0)\n",
      "Requirement already satisfied: google-cloud-bigquery-storage<2.17,>=2.6.3 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.16.2)\n",
      "Requirement already satisfied: google-cloud-core<3,>=0.28.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.3.2)\n",
      "Requirement already satisfied: google-cloud-bigtable<2,>=0.31.1 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.7.3)\n",
      "Requirement already satisfied: google-cloud-spanner<4,>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.32.0)\n",
      "Requirement already satisfied: google-cloud-dlp<4,>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.12.1)\n",
      "Requirement already satisfied: google-cloud-language<2,>=1.3.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.3.2)\n",
      "Requirement already satisfied: google-cloud-videointelligence<2,>=1.8.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.16.3)\n",
      "Requirement already satisfied: google-cloud-vision<4,>=2 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.4.1)\n",
      "Requirement already satisfied: google-cloud-recommendations-ai<0.8.0,>=0.1.0 in /opt/conda/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.7.1)\n",
      "Requirement already satisfied: astunparse>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (1.6.3)\n",
      "Requirement already satisfied: flatbuffers>=2.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (23.3.3)\n",
      "Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (0.4.0)\n",
      "Requirement already satisfied: google-pasta>=0.1.1 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (0.2.0)\n",
      "Requirement already satisfied: h5py>=2.9.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (3.8.0)\n",
      "Requirement already satisfied: keras<2.12,>=2.11.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (2.11.0)\n",
      "Requirement already satisfied: libclang>=13.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (16.0.0)\n",
      "Requirement already satisfied: opt-einsum>=2.3.2 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (3.3.0)\n",
      "Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (23.1)\n",
      "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (67.7.2)\n",
      "Requirement already satisfied: tensorboard<2.12,>=2.11 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (2.11.2)\n",
      "Requirement already satisfied: tensorflow-estimator<2.12,>=2.11.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (2.11.0)\n",
      "Requirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (2.3.0)\n",
      "Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (1.15.0)\n",
      "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /opt/conda/lib/python3.7/site-packages (from tensorflow<3,>=2.11->tensorflow-data-validation) (0.29.0)\n",
      "Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-metadata<1.13,>=1.12.0->tensorflow-data-validation) (1.56.4)\n",
      "Requirement already satisfied: google-api-python-client<2,>=1.7.11 in /opt/conda/lib/python3.7/site-packages (from tfx-bsl<1.13,>=1.12.0->tensorflow-data-validation) (1.8.0)\n",
      "Requirement already satisfied: tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15 in /opt/conda/lib/python3.7/site-packages (from tfx-bsl<1.13,>=1.12.0->tensorflow-data-validation) (2.11.0)\n",
      "Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from astunparse>=1.6.0->tensorflow<3,>=2.11->tensorflow-data-validation) (0.40.0)\n",
      "Requirement already satisfied: google-api-core<2dev,>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from google-api-python-client<2,>=1.7.11->tfx-bsl<1.13,>=1.12.0->tensorflow-data-validation) (1.34.0)\n",
      "Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from google-api-python-client<2,>=1.7.11->tfx-bsl<1.13,>=1.12.0->tensorflow-data-validation) (3.0.1)\n",
      "Requirement already satisfied: oauth2client>=1.4.12 in /opt/conda/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (4.1.3)\n",
      "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.18.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.2.7)\n",
      "Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.18.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (4.9)\n",
      "Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-bigquery<4,>=1.6.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.5.0)\n",
      "Requirement already satisfied: grpc-google-iam-v1<0.13dev,>=0.12.3 in /opt/conda/lib/python3.7/site-packages (from google-cloud-bigtable<2,>=0.31.1->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.12.6)\n",
      "Requirement already satisfied: grpcio-status>=1.33.2 in /opt/conda/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.48.1)\n",
      "Requirement already satisfied: overrides<7.0.0,>=6.0.1 in /opt/conda/lib/python3.7/site-packages (from google-cloud-pubsublite<2,>=1.2.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (6.5.0)\n",
      "Requirement already satisfied: sqlparse>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-spanner<4,>=3.0.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.4.4)\n",
      "Requirement already satisfied: docopt in /opt/conda/lib/python3.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.6.2)\n",
      "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from httplib2<0.22.0,>=0.8->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.0.9)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2.1.1)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (3.4)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.26.15)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (2022.12.7)\n",
      "Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (0.4.6)\n",
      "Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (3.4.3)\n",
      "Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (0.6.1)\n",
      "Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (1.8.1)\n",
      "Requirement already satisfied: werkzeug>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (2.1.2)\n",
      "Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (1.3.1)\n",
      "Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/conda/lib/python3.7/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery<4,>=1.6.0->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (1.5.0)\n",
      "Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (4.11.4)\n",
      "Requirement already satisfied: pyasn1>=0.1.7 in /opt/conda/lib/python3.7/site-packages (from oauth2client>=1.4.12->google-apitools<0.5.32,>=0.5.31->apache-beam[gcp]<3,>=2.40->tensorflow-data-validation) (0.4.8)\n",
      "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (3.15.0)\n",
      "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow<3,>=2.11->tensorflow-data-validation) (3.2.2)\n",
      "Building wheels for collected packages: pyfarmhash\n",
      "  Building wheel for pyfarmhash (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for pyfarmhash: filename=pyfarmhash-0.3.2-cp37-cp37m-linux_x86_64.whl size=108592 sha256=2947f3092ba9f4185b3b0f75dd2e3fe7ba046245d0d82ff815284e907d5b13f1\n",
      "  Stored in directory: /home/jupyter/.cache/pip/wheels/53/58/7a/3b040f3a2ee31908e3be916e32660db6db53621ce6eba838dc\n",
      "Successfully built pyfarmhash\n",
      "Installing collected packages: pyfarmhash, tensorflow-data-validation\n",
      "Successfully installed pyfarmhash-0.3.2 tensorflow-data-validation-1.12.0\n"
     ]
    }
   ],
   "source": [
    "!pip install pyarrow\n",
    "!pip install numpy\n",
    "!pip install tensorflow-data-validation --user"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AsHg6SD2nO1v"
   },
   "source": [
    "**Restart the kernel (Kernel > Restart kernel > Restart).**\n",
    "\n",
    "**Re-run the above cell and proceed further.**\n",
    "\n",
    "**Note: Please ignore any incompatibility warnings and errors.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-05-12 14:55:28.125612: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2023-05-12 14:55:33.851632: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64\n",
      "2023-05-12 14:55:33.852624: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64\n",
      "2023-05-12 14:55:33.852640: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing TensorFlow Data Validation\n",
      "TFDV version: 1.12.0\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import tensorflow_data_validation as tfdv\n",
    "import sys\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print('Installing TensorFlow Data Validation')\n",
    "!pip install -q tensorflow_data_validation[visualization]\n",
    "\n",
    "print('TFDV version: {}'.format(tfdv.version.__version__))\n",
    "# Confirm that we're using Python 3\n",
    "assert sys.version_info.major is 3, 'Oops, not running Python 3. Use Runtime > Change runtime type'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Fnm6Mj3vTGLm"
   },
   "source": [
    "###  Load the Consumer Spending Dataset\n",
    "\n",
    "We will download our dataset from Google Cloud Storage. The columns in the dataset are:\n",
    "\n",
    "* 'Graduated': Whether or not the person is a college graduate\n",
    "* 'Work Experience': The number of years in the workforce\n",
    "* 'Family Size': The size of the family unit\n",
    "* 'Spending Score': The spending score for consumer spending"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Graduated</th>\n",
       "      <th>Profession</th>\n",
       "      <th>Work_Experience</th>\n",
       "      <th>Family_Size</th>\n",
       "      <th>Spending_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>No</td>\n",
       "      <td>Healthcare</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Engineer</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>Average</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Engineer</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Lawyer</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Entertainment</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Graduated     Profession  Work_Experience  Family_Size Spending_Score\n",
       "0        No     Healthcare              1.0          4.0            Low\n",
       "1       Yes       Engineer              NaN          3.0        Average\n",
       "2       Yes       Engineer              1.0          1.0            Low\n",
       "3       Yes         Lawyer              0.0          2.0           High\n",
       "4       Yes  Entertainment              NaN          6.0           High"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read a comma-separated values (csv) file into DataFrame.\n",
    "# TODO: Your code goes here\n",
    "score_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Graduated</th>\n",
       "      <th>Profession</th>\n",
       "      <th>Work_Experience</th>\n",
       "      <th>Family_Size</th>\n",
       "      <th>Spending_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>No</td>\n",
       "      <td>Doctor</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>Average</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Entertainment</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>Average</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>No</td>\n",
       "      <td>Lawyer</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Executive</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Artist</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>Average</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Graduated     Profession  Work_Experience  Family_Size Spending_Score\n",
       "0        No         Doctor              0.0          5.0        Average\n",
       "1       Yes  Entertainment              1.0          4.0        Average\n",
       "2        No         Lawyer              0.0          5.0            Low\n",
       "3       Yes      Executive              1.0          5.0           High\n",
       "4       Yes         Artist              1.0          2.0        Average"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read a comma-separated values (csv) file into DataFrame.\n",
    "# TODO: Your code goes here\n",
    "score_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 4000 entries, 0 to 3999\n",
      "Data columns (total 5 columns):\n",
      " #   Column           Non-Null Count  Dtype  \n",
      "---  ------           --------------  -----  \n",
      " 0   Graduated        3964 non-null   object \n",
      " 1   Profession       3944 non-null   object \n",
      " 2   Work_Experience  3589 non-null   float64\n",
      " 3   Family_Size      3831 non-null   float64\n",
      " 4   Spending_Score   4000 non-null   object \n",
      "dtypes: float64(2), object(3)\n",
      "memory usage: 156.4+ KB\n"
     ]
    }
   ],
   "source": [
    "score_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Review the methods present in TFDV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['CombinerStatsGenerator',\n",
       " 'CrossFeatureView',\n",
       " 'DatasetListView',\n",
       " 'DatasetView',\n",
       " 'DetectFeatureSkew',\n",
       " 'FeaturePath',\n",
       " 'FeatureView',\n",
       " 'GenerateStatistics',\n",
       " 'MergeDatasetFeatureStatisticsList',\n",
       " 'StatsOptions',\n",
       " 'TransformStatsGenerator',\n",
       " 'WriteStatisticsToBinaryFile',\n",
       " 'WriteStatisticsToRecordsAndBinaryFile',\n",
       " 'WriteStatisticsToTFRecord',\n",
       " '__builtins__',\n",
       " '__cached__',\n",
       " '__doc__',\n",
       " '__file__',\n",
       " '__loader__',\n",
       " '__name__',\n",
       " '__package__',\n",
       " '__path__',\n",
       " '__spec__',\n",
       " '__version__',\n",
       " 'anomalies',\n",
       " 'api',\n",
       " 'arrow',\n",
       " 'coders',\n",
       " 'compare_slices',\n",
       " 'constants',\n",
       " 'default_sharded_output_suffix',\n",
       " 'default_sharded_output_supported',\n",
       " 'display_anomalies',\n",
       " 'display_schema',\n",
       " 'experimental_get_feature_value_slicer',\n",
       " 'generate_dummy_schema_with_paths',\n",
       " 'generate_statistics_from_csv',\n",
       " 'generate_statistics_from_dataframe',\n",
       " 'generate_statistics_from_tfrecord',\n",
       " 'get_confusion_count_dataframes',\n",
       " 'get_domain',\n",
       " 'get_feature',\n",
       " 'get_feature_stats',\n",
       " 'get_match_stats_dataframe',\n",
       " 'get_skew_result_dataframe',\n",
       " 'get_slice_stats',\n",
       " 'get_statistics_html',\n",
       " 'infer_schema',\n",
       " 'load_anomalies_text',\n",
       " 'load_schema_text',\n",
       " 'load_sharded_statistics',\n",
       " 'load_statistics',\n",
       " 'load_stats_binary',\n",
       " 'load_stats_text',\n",
       " 'pywrap',\n",
       " 'set_domain',\n",
       " 'skew',\n",
       " 'statistics',\n",
       " 'types',\n",
       " 'update_schema',\n",
       " 'utils',\n",
       " 'validate_corresponding_slices',\n",
       " 'validate_examples_in_csv',\n",
       " 'validate_examples_in_tfrecord',\n",
       " 'validate_statistics',\n",
       " 'version',\n",
       " 'visualize_statistics',\n",
       " 'write_anomalies_text',\n",
       " 'write_schema_text',\n",
       " 'write_stats_text']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check methods present in tfdv\n",
    "# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Describing data with TFDV\n",
    "The usual workflow when using TFDV during training is as follows:\n",
    "\n",
    "\n",
    "1.   Generate statistics for the data\n",
    "2.   Use those statistics to generate a schema for each feature\n",
    "3.   Visualize the schema and statistics and manually inspect them\n",
    "4.   Update the schema if needed\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute and visualize statistics\n",
    "\n",
    "First we'll use [`tfdv.generate_statistics_from_csv`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) to compute statistics for our training data. (ignore the snappy warnings)\n",
    "\n",
    "TFDV can compute descriptive [statistics](https://github.com/tensorflow/metadata/blob/v0.6.0/tensorflow_metadata/proto/v0/statistics.proto) that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.\n",
    "\n",
    "Internally, TFDV uses [Apache Beam](https://beam.apache.org/)'s data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.\n",
    "\n",
    "**NOTE:  Compute statistics**\n",
    "* [tfdv.generate_statistics_from_csv](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv)\n",
    "* [tfdv.generate_statistics_from_dataframe](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe)\n",
    "* [tfdv.generate_statistics_from_tfrecord](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate Statistics from a Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute data statistics for the input pandas DataFrame.\n",
    "# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use [`tfdv.visualize_statistics`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics), which uses [Facets](https://pair-code.github.io/facets/) to create a succinct visualization of our training data:\n",
    "\n",
    "* Notice that numeric features and categorical features are visualized separately, and that charts are displayed showing the distributions for each feature.\n",
    "* Notice that features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features.  The percentage is the percentage of examples that have missing or zero values for that feature.\n",
    "* Notice that there are no examples with values for `pickup_census_tract`.  This is an opportunity for dimensionality reduction!\n",
    "* Try clicking \"expand\" above the charts to change the display\n",
    "* Try hovering over bars in the charts to display bucket ranges and counts\n",
    "* Try switching between the log and linear scales, and notice how the log scale reveals much more detail about the `payment_type` categorical feature\n",
    "* Try selecting \"quantiles\" from the \"Chart to show\" menu, and hover over the markers to show the quantile percentages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id='facets-iframe' width=\"100%\" height=\"500px\"></iframe>\n",
       "        <script>\n",
       "        facets_iframe = document.getElementById('facets-iframe');\n",
       "        facets_html = '<script src=\"https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js\"><\\/script><link rel=\"import\" href=\"https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html\"><facets-overview proto-input=\"CscbCg5saHNfc3RhdGlzdGljcxCgHxqYAxACIoYDCrgCCPweECQYASABLQAAgD8ypAIaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QCABQPweEAIaDhIDWWVzGQAAAAAAAqNAGg0SAk5vGQAAAAAA7JdAJRNIJ0AqIwoOIgNZZXMpAAAAAAACo0AKEQgBEAEiAk5vKQAAAAAA7JdAQgsKCUdyYWR1YXRlZBr6BRACIucFCrgCCOgeEDgYASABLQAAgD8ypAIaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QCABQOgeEAkaERIGQXJ0aXN0GQAAAAAACJNAGhUSCkhlYWx0aGNhcmUZAAAAAADQhEAaGBINRW50ZXJ0YWlubWVudBkAAAAAAMB9QBoTEghFbmdpbmVlchkAAAAAAPB1QBoREgZEb2N0b3IZAAAAAACAdUAaERIGTGF3eWVyGQAAAAAAgHNAGhQSCUV4ZWN1dGl2ZRkAAAAAAMByQBoUEglNYXJrZXRpbmcZAAAAAABgY0AaFBIJSG9tZW1ha2VyGQAAAAAAgF5AJdoxAkEq4QEKESIGQXJ0aXN0KQAAAAAACJNAChkIARABIgpIZWFsdGhjYXJlKQAAAAAA0IRAChwIAhACIg1FbnRlcnRhaW5tZW50KQAAAAAAwH1AChcIAxADIghFbmdpbmVlcikAAAAAAPB1QAoVCAQQBCIGRG9jdG9yKQAAAAAAgHVAChUIBRAFIgZMYXd5ZXIpAAAAAACAc0AKGAgGEAYiCUV4ZWN1dGl2ZSkAAAAAAMByQAoYCAcQByIJTWFya2V0aW5nKQAAAAAAYGNAChgICBAIIglIb21lbWFrZXIpAAAAAACAXkBCDAoKUHJvZmVzc2lvbhr/BhABGucGCrkCCIUcEJsDGAEgAS0AAIA/MqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAgAUCFHBFz8JGatTYFQBn2B0/ZA1wLQCDsCDEAAAAAAADwPzkAAAAAAAAsQEKZAhoSEWZmZmZmZvY/ISfqoA7qDKJAGhsJZmZmZmZm9j8RZmZmZmZmBkAhrfiKr/hKY0AaGwlmZmZmZmYGQBHMzMzMzMwQQCHNC73QC/1uQBobCczMzMzMzBBAEWZmZmZmZhZAITSM9pvyXFdAGhsJZmZmZmZmFkARAAAAAAAAHEAhdRSuR+GaakAaGwkAAAAAAAAcQBHMzMzMzMwgQCG99KrNNW5sQBobCczMzMzMzCBAEZmZmZmZmSNAITwhtCPg0GpAGhsJmZmZmZmZI0ARZmZmZmZmJkAhog7qoA7aSUAaGwlmZmZmZmYmQBEzMzMzMzMpQCG5HoXrUdg8QBobCTMzMzMzMylAEQAAAAAAACxAIT0K16NwvUdAQuUBGgkhVVVVVVWVd0AaCSFVVVVVVZV3QBoJIVVVVVVVlXdAGhIRAAAAAAAA8D8hVVVVVVWFeEAaGwkAAAAAAADwPxEAAAAAAADwPyFVVVVVVYV4QBobCQAAAAAAAPA/EQAAAAAAAPA/IVVVVVVVhXhAGhsJAAAAAAAA8D8RAAAAAAAACEAhAAAAAAAAcUAaGwkAAAAAAAAIQBEAAAAAAAAYQCEAAAAAALB0QBobCQAAAAAAABhAEQAAAAAAACBAIQAAAAAA0HRAGhsJAAAAAAAAIEARAAAAAAAALEAhAAAAAACAdUAgAUIRCg9Xb3JrX0V4cGVyaWVuY2UayQcQARq1Bwq5Agj3HRCpARgBIAEtAACAPzKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAIAFA9x0RRHFnrHTCBkAZUAKLOaM7+D8pAAAAAAAA8D8xAAAAAAAACEA5AAAAAAAAIkBCogIaGwkAAAAAAADwPxHNzMzMzMz8PyENYBVHe3CGQBobCc3MzMzMzPw/Ec3MzMzMzARAIfisIMn6dJJAGhsJzczMzMzMBEARNDMzMzMzC0AhJo+pp9KBh0AaGwk0MzMzMzMLQBHNzMzMzMwQQCFJolE322mFQBobCc3MzMzMzBBAEQAAAAAAABRAIe4oXI/CE3NAGhsJAAAAAAAAFEARNDMzMzMzF0AhAH9qvHSTCEAaGwk0MzMzMzMXQBFnZmZmZmYaQCHJjkgYR8dXQBobCWdmZmZmZhpAEZqZmZmZmR1AIVIHdVAHNUlAGhsJmpmZmZmZHUARZmZmZmZmIEAhHoXrUbgeN0AaGwlmZmZmZmYgQBEAAAAAAAAiQCFI4XoUrkcxQEKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAWIZAGhsJAAAAAAAA8D8RAAAAAAAAAEAhAAAAAACgeEAaGwkAAAAAAAAAQBEAAAAAAAAAQCEAAAAAAKB4QBobCQAAAAAAAABAEQAAAAAAAABAIQAAAAAAoHhAGhsJAAAAAAAAAEARAAAAAAAACEAhAAAAAACId0AaGwkAAAAAAAAIQBEAAAAAAAAIQCEAAAAAAIh3QBobCQAAAAAAAAhAEQAAAAAAABBAIQAAAAAAcHVAGhsJAAAAAAAAEEARAAAAAAAAEEAhAAAAAABwdUAaGwkAAAAAAAAQQBEAAAAAAAAUQCEAAAAAACBzQBobCQAAAAAAABRAEQAAAAAAACJAIQAAAAAAoGdAIAFCDQoLRmFtaWx5X1NpemUaywMQAiK0Awq2AgigHxgBIAEtAACAPzKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAIAFAoB8QAxoOEgNMb3cZAAAAAAAgo0AaEhIHQXZlcmFnZRkAAAAAAEiOQBoPEgRIaWdoGQAAAAAAOIJAJQisg0AqPQoOIgNMb3cpAAAAAAAgo0AKFggBEAEiB0F2ZXJhZ2UpAAAAAABIjkAKEwgCEAIiBEhpZ2gpAAAAAAA4gkBCEAoOU3BlbmRpbmdfU2NvcmU=\"></facets-overview>';\n",
       "        facets_iframe.srcdoc = facets_html;\n",
       "         facets_iframe.id = \"\";\n",
       "         setTimeout(() => {\n",
       "           facets_iframe.setAttribute('height', facets_iframe.contentWindow.document.body.offsetHeight + 'px')\n",
       "         }, 1500)\n",
       "         </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Visualize the input statistics using Facets.\n",
    "# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TFDV generates different types of statistics based on the type of features.\n",
    "\n",
    "**For numerical features, TFDV computes for every feature:**\n",
    "* Count of records\n",
    "* Number of missing (i.e. null values)\n",
    "* Histogram of values\n",
    "* Mean and standard deviation\n",
    "* Minimum and maximum values\n",
    "* Percentage of zero values\n",
    "\n",
    "**For categorical features, TFDV provides:**\n",
    "* Count of values\n",
    "* Percentage of missing values\n",
    "* Number of unique values\n",
    "* Average string length\n",
    "* Count for each label and its rank"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's compare the score_train and the score_test datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id='facets-iframe' width=\"100%\" height=\"500px\"></iframe>\n",
       "        <script>\n",
       "        facets_iframe = document.getElementById('facets-iframe');\n",
       "        facets_html = '<script src=\"https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js\"><\\/script><link rel=\"import\" href=\"https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html\"><facets-overview proto-input=\"CsYbCg1UUkFJTl9EQVRBU0VUEKAfGpgDEAIihgMKuAII/B4QJBgBIAEtAACAPzKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmbGeEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZsZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmxnhAIAFA/B4QAhoOEgNZZXMZAAAAAAACo0AaDRICTm8ZAAAAAADsl0AlE0gnQCojCg4iA1llcykAAAAAAAKjQAoRCAEQASICTm8pAAAAAADsl0BCCwoJR3JhZHVhdGVkGvoFEAIi5wUKuAII6B4QOBgBIAEtAACAPzKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmameEAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZqZ4QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmpnhAIAFA6B4QCRoREgZBcnRpc3QZAAAAAAAIk0AaFRIKSGVhbHRoY2FyZRkAAAAAANCEQBoYEg1FbnRlcnRhaW5tZW50GQAAAAAAwH1AGhMSCEVuZ2luZWVyGQAAAAAA8HVAGhESBkRvY3RvchkAAAAAAIB1QBoREgZMYXd5ZXIZAAAAAACAc0AaFBIJRXhlY3V0aXZlGQAAAAAAwHJAGhQSCU1hcmtldGluZxkAAAAAAGBjQBoUEglIb21lbWFrZXIZAAAAAACAXkAl2jECQSrhAQoRIgZBcnRpc3QpAAAAAAAIk0AKGQgBEAEiCkhlYWx0aGNhcmUpAAAAAADQhEAKHAgCEAIiDUVudGVydGFpbm1lbnQpAAAAAADAfUAKFwgDEAMiCEVuZ2luZWVyKQAAAAAA8HVAChUIBBAEIgZEb2N0b3IpAAAAAACAdUAKFQgFEAUiBkxhd3llcikAAAAAAIBzQAoYCAYQBiIJRXhlY3V0aXZlKQAAAAAAwHJAChgIBxAHIglNYXJrZXRpbmcpAAAAAABgY0AKGAgIEAgiCUhvbWVtYWtlcikAAAAAAIBeQEIMCgpQcm9mZXNzaW9uGv8GEAEa5wYKuQIIhRwQmwMYASABLQAAgD8ypAIaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QBobCQAAAAAAAPA/EQAAAAAAAPA/IWZmZmZmbnZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hZmZmZmZudkAaGwkAAAAAAADwPxEAAAAAAADwPyFmZmZmZm52QCABQIUcEXPwkZq1NgVAGfYHT9kDXAtAIOwIMQAAAAAAAPA/OQAAAAAAACxAQpkCGhIRZmZmZmZm9j8hJ+qgDuoMokAaGwlmZmZmZmb2PxFmZmZmZmYGQCGt+Iqv+EpjQBobCWZmZmZmZgZAEczMzMzMzBBAIc0LvdAL/W5AGhsJzMzMzMzMEEARZmZmZmZmFkAhNIz2m/JcV0AaGwlmZmZmZmYWQBEAAAAAAAAcQCF1FK5H4ZpqQBobCQAAAAAAABxAEczMzMzMzCBAIb30qs01bmxAGhsJzMzMzMzMIEARmZmZmZmZI0AhPCG0I+DQakAaGwmZmZmZmZkjQBFmZmZmZmYmQCGiDuqgDtpJQBobCWZmZmZmZiZAETMzMzMzMylAIbkehetR2DxAGhsJMzMzMzMzKUARAAAAAAAALEAhPQrXo3C9R0BC5QEaCSFVVVVVVZV3QBoJIVVVVVVVlXdAGgkhVVVVVVWVd0AaEhEAAAAAAADwPyFVVVVVVYV4QBobCQAAAAAAAPA/EQAAAAAAAPA/IVVVVVVVhXhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hVVVVVVWFeEAaGwkAAAAAAADwPxEAAAAAAAAIQCEAAAAAAABxQBobCQAAAAAAAAhAEQAAAAAAABhAIQAAAAAAsHRAGhsJAAAAAAAAGEARAAAAAAAAIEAhAAAAAADQdEAaGwkAAAAAAAAgQBEAAAAAAAAsQCEAAAAAAIB1QCABQhEKD1dvcmtfRXhwZXJpZW5jZRrJBxABGrUHCrkCCPcdEKkBGAEgAS0AAIA/MqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmfF3QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZ8XdAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZnxd0AgAUD3HRFEcWesdMIGQBlQAos5ozv4PykAAAAAAADwPzEAAAAAAAAIQDkAAAAAAAAiQEKiAhobCQAAAAAAAPA/Ec3MzMzMzPw/IQ1gFUd7cIZAGhsJzczMzMzM/D8RzczMzMzMBEAh+Kwgyfp0kkAaGwnNzMzMzMwEQBE0MzMzMzMLQCEmj6mn0oGHQBobCTQzMzMzMwtAEc3MzMzMzBBAIUmiUTfbaYVAGhsJzczMzMzMEEARAAAAAAAAFEAh7ihcj8ITc0AaGwkAAAAAAAAUQBE0MzMzMzMXQCEAf2q8dJMIQBobCTQzMzMzMxdAEWdmZmZmZhpAIcmOSBhHx1dAGhsJZ2ZmZmZmGkARmpmZmZmZHUAhUgd1UAc1SUAaGwmamZmZmZkdQBFmZmZmZmYgQCEehetRuB43QBobCWZmZmZmZiBAEQAAAAAAACJAIUjhehSuRzFAQqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAABYhkAaGwkAAAAAAADwPxEAAAAAAAAAQCEAAAAAAKB4QBobCQAAAAAAAABAEQAAAAAAAABAIQAAAAAAoHhAGhsJAAAAAAAAAEARAAAAAAAAAEAhAAAAAACgeEAaGwkAAAAAAAAAQBEAAAAAAAAIQCEAAAAAAIh3QBobCQAAAAAAAAhAEQAAAAAAAAhAIQAAAAAAiHdAGhsJAAAAAAAACEARAAAAAAAAEEAhAAAAAABwdUAaGwkAAAAAAAAQQBEAAAAAAAAQQCEAAAAAAHB1QBobCQAAAAAAABBAEQAAAAAAABRAIQAAAAAAIHNAGhsJAAAAAAAAFEARAAAAAAAAIkAhAAAAAACgZ0AgAUINCgtGYW1pbHlfU2l6ZRrLAxACIrQDCrYCCKAfGAEgAS0AAIA/MqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAgAUCgHxADGg4SA0xvdxkAAAAAACCjQBoSEgdBdmVyYWdlGQAAAAAASI5AGg8SBEhpZ2gZAAAAAAA4gkAlCKyDQCo9Cg4iA0xvdykAAAAAACCjQAoWCAEQASIHQXZlcmFnZSkAAAAAAEiOQAoTCAIQAiIESGlnaCkAAAAAADiCQEIQCg5TcGVuZGluZ19TY29yZQrEGwoLTkVXX0RBVEFTRVQQ5B8amAMQAiKGAwq4Agi6HxAqGAEgAS0AAIA/MqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZkpeUAaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmSl5QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZKXlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZkpeUAaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmSl5QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZKXlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZkpeUAaGwkAAAAAAADwPxEAAAAAAADwPyGamZmZmSl5QBobCQAAAAAAAPA/EQAAAAAAAPA/IZqZmZmZKXlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hmpmZmZkpeUAgAUC6HxACGg4SA1llcxkAAAAAAM6jQBoNEgJObxkAAAAAAEyXQCVOTChAKiMKDiIDWWVzKQAAAAAAzqNAChEIARABIgJObykAAAAAAEyXQEILCglHcmFkdWF0ZWQa+gUQAiLnBQq4AgigHxBEGAEgAS0AAIA/MqQCGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAAAB5QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAAAHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAAAAeUAgAUCgHxAJGhESBkFydGlzdBkAAAAAAEiUQBoVEgpIZWFsdGhjYXJlGQAAAAAA0IRAGhgSDUVudGVydGFpbm1lbnQZAAAAAACQfUAaExIIRW5naW5lZXIZAAAAAADAdUAaERIGRG9jdG9yGQAAAAAAgHVAGhESBkxhd3llchkAAAAAAHBzQBoUEglFeGVjdXRpdmUZAAAAAACwckAaFBIJTWFya2V0aW5nGQAAAAAAIGFAGhQSCUhvbWVtYWtlchkAAAAAAABfQCVtZwFBKuEBChEiBkFydGlzdCkAAAAAAEiUQAoZCAEQASIKSGVhbHRoY2FyZSkAAAAAANCEQAocCAIQAiINRW50ZXJ0YWlubWVudCkAAAAAAJB9QAoXCAMQAyIIRW5naW5lZXIpAAAAAADAdUAKFQgEEAQiBkRvY3RvcikAAAAAAIB1QAoVCAUQBSIGTGF3eWVyKQAAAAAAcHNAChgIBhAGIglFeGVjdXRpdmUpAAAAAACwckAKGAgHEAciCU1hcmtldGluZykAAAAAACBhQAoYCAgQCCIJSG9tZW1ha2VyKQAAAAAAAF9AQgwKClByb2Zlc3Npb24a/wYQARrnBgq5AgjCHBCiAxgBIAEtAACAPzKkAhobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAA0HZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAADQdkAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAANB2QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAA0HZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAADQdkAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAANB2QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAA0HZAGhsJAAAAAAAA8D8RAAAAAAAA8D8hAAAAAADQdkAaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAANB2QBobCQAAAAAAAPA/EQAAAAAAAPA/IQAAAAAA0HZAIAFAwhwREQSDGuMNBUAZZj/bdWolC0AgogkxAAAAAAAA8D85AAAAAAAALEBCmQIaEhFmZmZmZmb2PyGP9Emf9HiiQBobCWZmZmZmZvY/EWZmZmZmZgZAIcTbrghVzGBAGhsJZmZmZmZmBkARzMzMzMzMEEAhsiq5d7H7b0AaGwnMzMzMzMwQQBFmZmZmZmYWQCHxJ3/yJ99ZQBobCWZmZmZmZhZAEQAAAAAAABxAIZQbuZEb2WZAGhsJAAAAAAAAHEARzMzMzMzMIEAhHIGPsw3PbUAaGwnMzMzMzMwgQBGZmZmZmZkjQCGJnik82GlwQBobCZmZmZmZmSNAEWZmZmZmZiZAIT5ZLW+MYEdAGhsJZmZmZmZmJkARMzMzMzMzKUAhc+6zC4hNNkAaGwkzMzMzMzMpQBEAAAAAAAAsQCHdtm3btm1DQELlARoJIVVVVVVVtXhAGgkhVVVVVVW1eEAaCSFVVVVVVbV4QBoSEQAAAAAAAPA/IVVVVVVVhXhAGhsJAAAAAAAA8D8RAAAAAAAA8D8hVVVVVVWFeEAaGwkAAAAAAADwPxEAAAAAAADwPyFVVVVVVYV4QBobCQAAAAAAAPA/EQAAAAAAAAhAIQAAAAAA0HBAGhsJAAAAAAAACEARAAAAAAAAGEAhAAAAAAAAdEAaGwkAAAAAAAAYQBEAAAAAAAAiQCEAAAAAAFiCQBobCQAAAAAAACJAEQAAAAAAACxAIQAAAAAAwFtAIAFCEQoPV29ya19FeHBlcmllbmNlGskHEAEatQcKuQIIvh4QpgEYASABLQAAgD8ypAIaGwkAAAAAAADwPxEAAAAAAADwPyEzMzMzM2N4QBobCQAAAAAAAPA/EQAAAAAAAPA/ITMzMzMzY3hAGhsJAAAAAAAA8D8RAAAAAAAA8D8hMzMzMzNjeEAaGwkAAAAAAADwPxEAAAAAAADwPyEzMzMzM2N4QBobCQAAAAAAAPA/EQAAAAAAAPA/ITMzMzMzY3hAGhsJAAAAAAAA8D8RAAAAAAAA8D8hMzMzMzNjeEAaGwkAAAAAAADwPxEAAAAAAADwPyEzMzMzM2N4QBobCQAAAAAAAPA/EQAAAAAAAPA/ITMzMzMzY3hAGhsJAAAAAAAA8D8RAAAAAAAA8D8hMzMzMzNjeEAaGwkAAAAAAADwPxEAAAAAAADwPyEzMzMzM2N4QCABQL4eEbXHMmN01wZAGaZ0VTTZwvg/KQAAAAAAAPA/MQAAAAAAAAhAOQAAAAAAACJAQqICGhsJAAAAAAAA8D8RzczMzMzM/D8haTn2IgUph0AaGwnNzMzMzMz8PxHNzMzMzMwEQCEQyFay1tySQBobCc3MzMzMzARAETQzMzMzMwtAIbyFr9jUOYdAGhsJNDMzMzMzC0ARzczMzMzMEEAhO+31LKuhhUAaGwnNzMzMzMwQQBEAAAAAAAAUQCF3h23lmhNzQBobCQAAAAAAABRAETQzMzMzMxdAIcVY8oslvwhAGhsJNDMzMzMzF0ARZ2ZmZmZmGkAhkeFxGR7XXEAaGwlnZmZmZmYaQBGamZmZmZkdQCEpQxtX/QNGQBobCZqZmZmZmR1AEWZmZmZmZiBAIR3UQR3UQTlAGhsJZmZmZmZmIEARAAAAAAAAIkAhH9RBHdRBOUBCpAIaGwkAAAAAAADwPxEAAAAAAADwPyEAAAAAABCHQBobCQAAAAAAAPA/EQAAAAAAAABAIauqqqqqKnlAGhsJAAAAAAAAAEARAAAAAAAAAEAhq6qqqqoqeUAaGwkAAAAAAAAAQBEAAAAAAAAAQCGrqqqqqip5QBobCQAAAAAAAABAEQAAAAAAAAhAIQAAAAAAQHdAGhsJAAAAAAAACEARAAAAAAAACEAhAAAAAABAd0AaGwkAAAAAAAAIQBEAAAAAAAAQQCEAAAAAAKh1QBobCQAAAAAAABBAEQAAAAAAABBAIQAAAAAAqHVAGhsJAAAAAAAAEEARAAAAAAAAFEAhAAAAAAAgc0AaGwkAAAAAAAAUQBEAAAAAAAAiQCEAAAAAAKBqQCABQg0KC0ZhbWlseV9TaXplGssDEAIitAMKtgII5B8YASABLQAAgD8ypAIaGwkAAAAAAADwPxEAAAAAAADwPyHNzMzMzGx5QBobCQAAAAAAAPA/EQAAAAAAAPA/Ic3MzMzMbHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hzczMzMxseUAaGwkAAAAAAADwPxEAAAAAAADwPyHNzMzMzGx5QBobCQAAAAAAAPA/EQAAAAAAAPA/Ic3MzMzMbHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hzczMzMxseUAaGwkAAAAAAADwPxEAAAAAAADwPyHNzMzMzGx5QBobCQAAAAAAAPA/EQAAAAAAAPA/Ic3MzMzMbHlAGhsJAAAAAAAA8D8RAAAAAAAA8D8hzczMzMxseUAaGwkAAAAAAADwPxEAAAAAAADwPyHNzMzMzGx5QCABQOQfEAMaDhIDTG93GQAAAAAA/KJAGhISB0F2ZXJhZ2UZAAAAAABoj0AaDxIESGlnaBkAAAAAAMiDQCUOmoRAKj0KDiIDTG93KQAAAAAA/KJAChYIARABIgdBdmVyYWdlKQAAAAAAaI9AChMIAhACIgRIaWdoKQAAAAAAyINAQhAKDlNwZW5kaW5nX1Njb3Jl\"></facets-overview>';\n",
       "        facets_iframe.srcdoc = facets_html;\n",
       "         facets_iframe.id = \"\";\n",
       "         setTimeout(() => {\n",
       "           facets_iframe.setAttribute('height', facets_iframe.contentWindow.document.body.offsetHeight + 'px')\n",
       "         }, 1500)\n",
       "         </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_stats = tfdv.generate_statistics_from_dataframe(dataframe=score_train)\n",
    "test_stats = tfdv.generate_statistics_from_dataframe(dataframe=score_test)\n",
    "\n",
    "tfdv.visualize_statistics(\n",
    "  lhs_statistics=train_stats, lhs_name='TRAIN_DATASET',\n",
    "  rhs_statistics=test_stats, rhs_name='NEW_DATASET')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KVR02-y4V0uM"
   },
   "source": [
    "### Infer a schema\n",
    "\n",
    "Now let's use [`tfdv.infer_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) to create a schema for our data.  A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data.  For categorical features the schema also defines the domain - the list of acceptable values.  Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.\n",
    "\n",
    "Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generating Schema\n",
    "Once statistics are generated, the next step is to generate a schema for our dataset. This schema will map each feature in the dataset to a type (float, bytes, etc.). Also define feature boundaries (min, max, distribution of values and missings, etc.).\n",
    "\n",
    "Link to infer schema\n",
    "https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema\n",
    "\n",
    "With TFDV, we generate schema from statistics using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6LLkRJThVr9m"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "feature {\n",
      "  name: \"Graduated\"\n",
      "  value_count {\n",
      "    min: 1\n",
      "    max: 1\n",
      "  }\n",
      "  type: BYTES\n",
      "  domain: \"Graduated\"\n",
      "  presence {\n",
      "    min_count: 1\n",
      "  }\n",
      "}\n",
      "feature {\n",
      "  name: \"Profession\"\n",
      "  value_count {\n",
      "    min: 1\n",
      "    max: 1\n",
      "  }\n",
      "  type: BYTES\n",
      "  domain: \"Profession\"\n",
      "  presence {\n",
      "    min_count: 1\n",
      "  }\n",
      "}\n",
      "feature {\n",
      "  name: \"Work_Experience\"\n",
      "  value_count {\n",
      "    min: 1\n",
      "    max: 1\n",
      "  }\n",
      "  type: FLOAT\n",
      "  presence {\n",
      "    min_count: 1\n",
      "  }\n",
      "}\n",
      "feature {\n",
      "  name: \"Family_Size\"\n",
      "  value_count {\n",
      "    min: 1\n",
      "    max: 1\n",
      "  }\n",
      "  type: FLOAT\n",
      "  presence {\n",
      "    min_count: 1\n",
      "  }\n",
      "}\n",
      "feature {\n",
      "  name: \"Spending_Score\"\n",
      "  type: BYTES\n",
      "  domain: \"Spending_Score\"\n",
      "  presence {\n",
      "    min_fraction: 1.0\n",
      "    min_count: 1\n",
      "  }\n",
      "  shape {\n",
      "    dim {\n",
      "      size: 1\n",
      "    }\n",
      "  }\n",
      "}\n",
      "string_domain {\n",
      "  name: \"Graduated\"\n",
      "  value: \"No\"\n",
      "  value: \"Yes\"\n",
      "}\n",
      "string_domain {\n",
      "  name: \"Profession\"\n",
      "  value: \"Artist\"\n",
      "  value: \"Doctor\"\n",
      "  value: \"Engineer\"\n",
      "  value: \"Entertainment\"\n",
      "  value: \"Executive\"\n",
      "  value: \"Healthcare\"\n",
      "  value: \"Homemaker\"\n",
      "  value: \"Lawyer\"\n",
      "  value: \"Marketing\"\n",
      "}\n",
      "string_domain {\n",
      "  name: \"Spending_Score\"\n",
      "  value: \"Average\"\n",
      "  value: \"High\"\n",
      "  value: \"Low\"\n",
      "}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Infers schema from the input statistics.\n",
    "# TODO: Your code goes here\n",
    "print(schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The schema also provides documentation for the data, and so is useful when different developers work on the same data.  Let's use [`tfdv.display_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) to display the inferred schema so that we can review it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Presence</th>\n",
       "      <th>Valency</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Feature name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>'Graduated'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>'Profession'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Work_Experience'</th>\n",
       "      <td>FLOAT</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Family_Size'</th>\n",
       "      <td>FLOAT</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td></td>\n",
       "      <td>'Spending_Score'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Type  Presence Valency            Domain\n",
       "Feature name                                                 \n",
       "'Graduated'        STRING  optional  single       'Graduated'\n",
       "'Profession'       STRING  optional  single      'Profession'\n",
       "'Work_Experience'   FLOAT  optional  single                 -\n",
       "'Family_Size'       FLOAT  optional  single                 -\n",
       "'Spending_Score'   STRING  required          'Spending_Score'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Values</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Domain</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>'No', 'Yes'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>'Average', 'High', 'Low'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                          Values\n",
       "Domain                                                                                                                          \n",
       "'Graduated'                                                                                                          'No', 'Yes'\n",
       "'Profession'      'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'\n",
       "'Spending_Score'                                                                                        'Average', 'High', 'Low'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "tfdv.display_schema(schema=schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TFDV provides a API to print a summary of each feature schema using\n",
    "\n",
    "In this visualization, the columns stand for:\n",
    "\n",
    "**Type** indicates the feature datatype.\n",
    "\n",
    "**Presence** indicates whether the feature must be present in 100% of examples (required) or not (optional).\n",
    "\n",
    "**Valency** indicates the number of values required per training example. \n",
    "\n",
    "**Domain and Values** indicates The feature domain and its values\n",
    "\n",
    "In the case of categorical features, single indicates that each training example must have exactly one category for the feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Updating the Schema \n",
    "As stated above, **Presence** indicates whether the feature must be present in 100% of examples (required) or not (optional).  Currently, all of our features except for our target label are shown as \"optional\". We need to make our features all required except for \"Work Experience\".  We will need to update the schema."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TFDV lets you update the schema according to your domain knowledge of the data if you are not satisfied by the auto-generated schema.  We will update three use cases:  Making a feature required, adding a value to a feature, and change a feature from a float to an integer. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Change optional features to required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update Family_Size from FLOAT to Int\n",
    "Graduated_feature = tfdv.get_feature(schema, 'Graduated')\n",
    "Graduated_feature.presence.min_fraction = 1.0\n",
    "Profession_feature = tfdv.get_feature(schema, 'Profession')\n",
    "Profession_feature.presence.min_fraction = 1.0\n",
    "Family_Size_feature = tfdv.get_feature(schema, 'Family_Size')\n",
    "Family_Size_feature.presence.min_fraction = 1.0\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Presence</th>\n",
       "      <th>Valency</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Feature name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>'Graduated'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>'Profession'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Work_Experience'</th>\n",
       "      <td>FLOAT</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Family_Size'</th>\n",
       "      <td>FLOAT</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td></td>\n",
       "      <td>'Spending_Score'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Type  Presence Valency            Domain\n",
       "Feature name                                                 \n",
       "'Graduated'        STRING  required  single       'Graduated'\n",
       "'Profession'       STRING  required  single      'Profession'\n",
       "'Work_Experience'   FLOAT  optional  single                 -\n",
       "'Family_Size'       FLOAT  required  single                 -\n",
       "'Spending_Score'   STRING  required          'Spending_Score'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Values</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Domain</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>'No', 'Yes'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>'Average', 'High', 'Low'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                          Values\n",
       "Domain                                                                                                                          \n",
       "'Graduated'                                                                                                          'No', 'Yes'\n",
       "'Profession'      'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'\n",
       "'Spending_Score'                                                                                        'Average', 'High', 'Low'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "tfdv.display_schema(schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Update a feature with a new value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add \"self-employed\" to the Profession feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Profession_domain = tfdv.get_domain(schema, 'Profession')\n",
    "Profession_domain.value.insert(0, 'Self-Employed')\n",
    "Profession_domain.value\n",
    "# [0 indicates I want 'Self-Employed to come first', if the number were 3, \n",
    "# it would be placed after the third value. ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's remove \"Homemaker\" from \"Profession\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "Profession_domain = tfdv.get_domain(schema, 'Profession')\n",
    "Profession_domain.value.remove('Homemaker')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Lawyer', 'Marketing']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Profession_domain.value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Change a feature from a float to an integer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update Family_Size to Int\n",
    "size = tfdv.get_feature(schema, 'Family_Size')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Presence</th>\n",
       "      <th>Valency</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Feature name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>'Graduated'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>'Profession'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Work_Experience'</th>\n",
       "      <td>FLOAT</td>\n",
       "      <td>optional</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Family_Size'</th>\n",
       "      <td>INT</td>\n",
       "      <td>required</td>\n",
       "      <td>single</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>STRING</td>\n",
       "      <td>required</td>\n",
       "      <td></td>\n",
       "      <td>'Spending_Score'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Type  Presence Valency            Domain\n",
       "Feature name                                                 \n",
       "'Graduated'        STRING  required  single       'Graduated'\n",
       "'Profession'       STRING  required  single      'Profession'\n",
       "'Work_Experience'   FLOAT  optional  single                 -\n",
       "'Family_Size'         INT  required  single                 -\n",
       "'Spending_Score'   STRING  required          'Spending_Score'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Values</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Domain</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Graduated'</th>\n",
       "      <td>'No', 'Yes'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Profession'</th>\n",
       "      <td>'Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Lawyer', 'Marketing'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>'Spending_Score'</th>\n",
       "      <td>'Average', 'High', 'Low'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                              Values\n",
       "Domain                                                                                                                              \n",
       "'Graduated'                                                                                                              'No', 'Yes'\n",
       "'Profession'      'Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Lawyer', 'Marketing'\n",
       "'Spending_Score'                                                                                            'Average', 'High', 'Low'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "size.type=2\n",
    "tfdv.display_schema(schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next lab, you compare two datasets and check for anomalies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "b8eC59yISdGB"
   },
   "source": [
    "## When to use TFDV\n",
    "\n",
    "It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here are a few more:\n",
    "\n",
    "* Validating new data for inference to make sure that we haven't suddenly started receiving bad features\n",
    "* Validating new data for inference to make sure that our model has trained on that part of the decision surface\n",
    "* Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow Transform](https://www.tensorflow.org/tfx/guide/transform)) to make sure we haven't done something wrong"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://github.com/GoogleCloudPlatform/mlops-on-gcp/blob/master/examples/tfdv-structured-data/tfdv-covertype.ipynb"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "tghWegsjhpkt"
   ],
   "name": "tfdv_basic.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "environment": {
   "kernel": "python3",
   "name": "tf2-gpu.2-11.m108",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-11:m108"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
